display(HTML(sales_dist_div))
display(HTML(order_dist_div))
display(HTML(spo_dist_div))Note: This report showcases the analyses performed for the cafe, but prepared with anonymized and modified data for demonstration purposes.
Table of contents
Exploratory data analysis
1.1 Revenue decomposition by day of the week
1.2 Revenue decomposition by type of day and season
1.3 Trends
1.4 Hot Chocolate Festival (HCF) impact on salesRegression analysis - HCF
2.1 HCF impact - Item A
2.2 HCF impact - Item BRegression analysis - individual item sales
3.1 Understanding Item A sales
3.2 Understanding Item B sales
3.3 Understanding Item C salesAppendix 1 - Statistical tests on recommendations from section 1.2
Appendix 2 - Regression models - multicollinearity checks
Appendix 3 - Regression models - residual plots
1. Exploratory data analysis
1.1 Revenue decomposition by day of the week
Analysis
Weekends show the highest overall demand, but also exhibit the greatest fluctuations in sales (greater spread in boxplots).
Differences in sales between weekends and weekdays appear to be driven by variations in customer traffic, as the distribution of sales per order remains consistent across the week.
Weekdays demand is relatively stable, with overlapping sales distributions across Mondays to Thursdays.
Sales on holidays are comparable to Fridays, suggesting strong demand.
Wednesday holidays (e.g., Christmas Day and New Year’s Day) saw higher sales per order but fewer total orders, possibly reflecting larger group purchases or celebrations.
1.2 Revenue decomposition by type of day and season
rev_decomp_fig.show()Analysis
Summer revenue was driven by a high number of orders, however sales per order remained comparable to other the seasons, suggesting strong customer traffic rather than higher spending per visit.
Fall and Winter saw similar customer traffic, but Winter had higher average sales per order, possibly due to holiday-related purchases.
In Spring, the number of orders increased notably compared to Winter, though sales per order remained relatively stable, suggesting customer traffic starts to pick up in Spring, but customers are not necessarily spending more per visit.
Across all seasons except Summer, holidays experienced higher average sales per order when compared to weekends and Fridays.
Weekends consistently recorded higher sales per order than Fridays, regardless of season. This emphasizes the importance of the weekend traffic.
Recommendations
Spring and Summer show a pick up in customer traffic, consider launching new items during these seasons to capitalize on the higher traffic.
As the propensity to spend appear to be higher on holidays, it offers an opportunity to offer limited-time combo’s or promotions to further boost revenue.
Please refer to Appendix 1 for the statistical tests performed in support of the recommendations.
1.3 Trends
weekly_trend.show()
sales_spo.show()Analysis
Revenue during slow period (much of fall and winter) was dragged down by lower foot traffic, though those who visit spend more compared to the summer months.
The period where item A sales overtook item B (late October to late February), coincides with stronger sales per order but fewer total orders. This suggests:
- The business lost a portion of customers who buys item B (lower dollar amount) during slow periods;
- Item A customers are more stable throughout the year and spend more per transaction, helping partially offset reduced customer traffic.
Hot Chocolate Festival (HCF) did not appear to have significantly driven sales in other menu items, indicating that festival-driven visits is more product-specific rather than creating cross-sell opportunities.
Recommendations
- To improve revenue during the slow period, need to find ways to boost item A sales. E.g. introducing seasonal flavours, or combo with hot drinks etc.
1.4 Hot Chocolate Festival (HCF) impact on sales
As HCF happened during the Winter, we’ll compare sales during the HCF days and non-HCF days in the Winter only for a more accurate comparison.
HCF_A_fig.show()
HCF_B_fig.show()Analysis
HCF did not appear to have a strong effect on item A sales. Most days exhibited overlapping distributions, with Saturdays being the only exception.
HCF seems to have suppresssed item B sales, with three of the days exhibiting notably lower sales.
To supplement the graphical analysis above, we’ll perform a regression analysis.
2. Regression analysis - Hot Chocolate Festival (HCF)
2.1 HCF impact - Item A
Here we attempt to explain daily item A sales during the Winter using the following factors:
rain: Amount of rainfall (mm) in Metro Vancouver. This models the effect of rainfall on item A sales.snow: Amount of snowfall (mm) in Metro Vancouver. This models the effect of snowfall on item A sales.is_long_weekend: Represents whether the day is part of a long weekend. This captures the effect of long weekends on item A sales.is_holiday: Represents whether the day is a holiday. This captures the effect of holidays on item A sales.day_of_week: Represents the day of the week (Monday, Tuesday etc.). This captures the weekly seasonality of the business.is_HCF: Represents whether the day is a Hot Chocolate Festival day. This captures the effect of HCF on item A sales.
The variable of interest here is is HCF. If participating in the HCF does have a spillover effect on item A sales, is_HCF will have a positive coefficient that is statistically significant.
formula = (
"""
item_A_sales ~
+ rain
+ snow
+ is_long_weekend
+ C(is_holiday)
+ C(day_of_week)
+ C(is_HCF)
"""
)
hcf_A_model = smf.ols(formula, data=hcf_A_df).fit()
print(hcf_A_model.summary()) OLS Regression Results
==============================================================================
Dep. Variable: item_A_sales R-squared: 0.796
Model: OLS Adj. R-squared: 0.767
Method: Least Squares F-statistic: 27.69
Date: Fri, 15 Aug 2025 Prob (F-statistic): 1.60e-22
Time: 13:57:59 Log-Likelihood: -601.40
No. Observations: 90 AIC: 1227.
Df Residuals: 78 BIC: 1257.
Df Model: 11
Covariance Type: nonrobust
===============================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------
Intercept 1715.8404 61.974 27.687 0.000 1592.460 1839.221
is_long_weekend[T.True] 111.4486 132.132 0.843 0.402 -151.606 374.503
C(is_holiday)[T.True] 815.1812 121.936 6.685 0.000 572.426 1057.937
C(day_of_week)[T.Tuesday] -80.4367 82.758 -0.972 0.334 -245.195 84.322
C(day_of_week)[T.Wednesday] 74.6287 83.686 0.892 0.375 -91.978 241.235
C(day_of_week)[T.Thursday] 22.3063 83.004 0.269 0.789 -142.941 187.554
C(day_of_week)[T.Friday] 449.5122 84.417 5.325 0.000 281.451 617.573
C(day_of_week)[T.Saturday] 809.8290 85.409 9.482 0.000 639.792 979.866
C(day_of_week)[T.Sunday] 841.2587 82.239 10.229 0.000 677.533 1004.984
C(is_HCF)[T.True] 73.7003 52.359 1.408 0.163 -30.539 177.939
rain -2.7682 4.069 -0.680 0.498 -10.868 5.332
snow -34.8685 23.917 -1.458 0.149 -82.484 12.747
==============================================================================
Omnibus: 19.947 Durbin-Watson: 2.254
Prob(Omnibus): 0.000 Jarque-Bera (JB): 27.493
Skew: 1.016 Prob(JB): 1.07e-06
Kurtosis: 4.790 Cond. No. 53.9
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Analysis
The coefficient for
is_HCFis not statistically different from 0, indicating there is no strong evidence for a spillover effect on item A from participating in the HCF, after controlling for other variables. The positive coefficient could be interpreted as weak evidence that HCF did drive some item A sales, but not by a significant amount.Item A sales during the Winter was still primarily explained by the weekly seasonality and holiday effects.
2.2 HCF impact - Item B
We will be following the same modelling approach from above for HCF’s impact on item B.
formula = (
"""
item_B_sales ~
+ rain
+ snow
+ is_long_weekend
+ C(is_holiday)
+ C(day_of_week)
+ C(is_HCF)
"""
)
hcf_B_model = smf.ols(formula, data=hcf_b_df).fit()
print(hcf_B_model.summary()) OLS Regression Results
==============================================================================
Dep. Variable: item_B_sales R-squared: 0.748
Model: OLS Adj. R-squared: 0.712
Method: Least Squares F-statistic: 21.05
Date: Fri, 15 Aug 2025 Prob (F-statistic): 4.76e-19
Time: 13:57:59 Log-Likelihood: -604.63
No. Observations: 90 AIC: 1233.
Df Residuals: 78 BIC: 1263.
Df Model: 11
Covariance Type: nonrobust
===============================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------
Intercept 1663.6781 64.236 25.900 0.000 1535.795 1791.562
is_long_weekend[T.True] 424.4739 136.954 3.099 0.003 151.819 697.129
C(is_holiday)[T.True] 413.8744 126.386 3.275 0.002 162.259 665.490
C(day_of_week)[T.Tuesday] -142.9222 85.779 -1.666 0.100 -313.694 27.850
C(day_of_week)[T.Wednesday] -131.4728 86.741 -1.516 0.134 -304.160 41.215
C(day_of_week)[T.Thursday] -50.0111 86.033 -0.581 0.563 -221.290 121.268
C(day_of_week)[T.Friday] 422.4017 87.498 4.828 0.000 248.207 596.597
C(day_of_week)[T.Saturday] 676.2040 88.527 7.638 0.000 499.961 852.447
C(day_of_week)[T.Sunday] 528.3789 85.241 6.199 0.000 358.678 698.080
C(is_HCF)[T.True] -24.7872 54.270 -0.457 0.649 -132.831 83.256
rain -0.6680 4.217 -0.158 0.875 -9.064 7.728
snow -60.8765 24.790 -2.456 0.016 -110.230 -11.523
==============================================================================
Omnibus: 0.065 Durbin-Watson: 1.576
Prob(Omnibus): 0.968 Jarque-Bera (JB): 0.235
Skew: -0.009 Prob(JB): 0.889
Kurtosis: 2.750 Cond. No. 53.9
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Analysis
Similar to item A, the coefficient for
is_HCFis not statistically different from 0, indicating there is no strong evidence that item B sales during HCF differed from sales during non-HCF periods, after controlling for other variables.Item B sales during the Winter were also primarily explained by the weekly seasonality and holiday effects, but compared to item A, long weekends and weather conditions also contributed.
3. Regression analysis - individual item sales
3.1 Understanding Item A sales
Here we attempt to explain daily item A sales on a broader scale with the following factors:
rain: Amount of rainfall (mm) in Metro Vancouver. This models the effect of rainfall on item A sales.snow: Amount of snowfall (mm) in Metro Vancouver. This models the effect of snowfall on item A sales.is_long_weekend: Represents whether the day is part of a long weekend. This captures the effect of long weekends on item A sales.is_holiday: Represents whether the day is a holiday. This captures the effect of holidays on item A sales.season: Represents the season (Summer, Fall etc.). This captures the seasonal effects of the business.day_of_week: Represents the day of the week (Monday, Tuesday etc.). This captures the weekly seasonality of the business.
formula = (
"""
log_A ~
+ rain
+ snow
+ is_long_weekend
+ C(is_holiday)
+ C(season)
+ C(day_of_week)
"""
)
A_model = smf.ols(formula, data=A_sales_df).fit()
print(A_model.summary()) OLS Regression Results
==============================================================================
Dep. Variable: log_A R-squared: 0.766
Model: OLS Adj. R-squared: 0.756
Method: Least Squares F-statistic: 81.75
Date: Fri, 15 Aug 2025 Prob (F-statistic): 4.71e-94
Time: 13:57:59 Log-Likelihood: 298.17
No. Observations: 339 AIC: -568.3
Df Residuals: 325 BIC: -514.8
Df Model: 13
Covariance Type: nonrobust
===============================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------
Intercept 7.4547 0.019 397.498 0.000 7.418 7.492
is_long_weekend[T.True] 0.0040 0.026 0.154 0.878 -0.048 0.056
C(is_holiday)[T.True] 0.3098 0.035 8.771 0.000 0.240 0.379
C(season)[T.Spring] 0.0927 0.017 5.555 0.000 0.060 0.125
C(season)[T.Summer] 0.1597 0.015 10.317 0.000 0.129 0.190
C(season)[T.Fall] -0.0206 0.016 -1.323 0.187 -0.051 0.010
C(day_of_week)[T.Tuesday] 0.0098 0.022 0.452 0.651 -0.033 0.053
C(day_of_week)[T.Wednesday] 0.0330 0.022 1.525 0.128 -0.010 0.075
C(day_of_week)[T.Thursday] 0.0152 0.022 0.705 0.481 -0.027 0.058
C(day_of_week)[T.Friday] 0.2339 0.022 10.733 0.000 0.191 0.277
C(day_of_week)[T.Saturday] 0.4041 0.022 18.273 0.000 0.361 0.448
C(day_of_week)[T.Sunday] 0.3536 0.022 16.108 0.000 0.310 0.397
rain -0.0011 0.001 -1.365 0.173 -0.003 0.001
snow -0.0121 0.010 -1.176 0.241 -0.032 0.008
==============================================================================
Omnibus: 25.337 Durbin-Watson: 1.237
Prob(Omnibus): 0.000 Jarque-Bera (JB): 38.735
Skew: 0.509 Prob(JB): 3.88e-09
Kurtosis: 4.306 Cond. No. 64.2
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Analysis
Item A sales are not strongly correlated with weather conditions (rain or snow), indicating that customer behavior is relatively unaffected by these factors.
Variations in item A sales are primarily driven by seasonality — both in terms of the day of the week and the actual season — as well as whether the day is a holiday.
Model intepretation
To improve model performance, a log transformation was applied to item A sales. For easier interpretation, we will reverse this transformation by exponentiating the coefficients below.
exp_coef| coeff | p-value | |
|---|---|---|
| features | ||
| Intercept | 1727.95 | 0.000 |
| is_long_weekend[T.True] | 1.00 | 0.878 |
| C(is_holiday)[T.True] | 1.36 | 0.000 |
| C(season)[T.Spring] | 1.10 | 0.000 |
| C(season)[T.Summer] | 1.17 | 0.000 |
| C(season)[T.Fall] | 0.98 | 0.187 |
| C(day_of_week)[T.Tuesday] | 1.01 | 0.651 |
| C(day_of_week)[T.Wednesday] | 1.03 | 0.128 |
| C(day_of_week)[T.Thursday] | 1.02 | 0.481 |
| C(day_of_week)[T.Friday] | 1.26 | 0.000 |
| C(day_of_week)[T.Saturday] | 1.50 | 0.000 |
| C(day_of_week)[T.Sunday] | 1.42 | 0.000 |
| rain | 1.00 | 0.173 |
| snow | 0.99 | 0.241 |
Base case: Average item A sales = $1727.95 on a regular Monday in Winter with no rain and no snow.
is_long_weekend - No significant impact on sales.
is_holiday - All else being equal, on average, sales during on a holiday is roughly 36% higher than the base case.
season - No significant difference in item A sales between Winter and Fall.
- Compared to Winter, and all else being equal:
- Spring sales are 10% higher
- Summer sales are 17% higher
day_of_week - No significant difference in item A sales between the weekdays (Mon - Thurs).
- Relative to Monday, and all else being equal:
- Friday sales are 26% higher
- Saturday sales are 50% higher
- Sunday sales are 42% higher
rain - No significant impact on sales.
snow - No significant impact on sales.
For example:
Average item A sales on a Friday in the Summer with no rain and no snow = $1727.95 x 1.26 x 1.17 = $2547.34
3.2 Understanding Item B sales
We will be modelling overall item B sales with the same approach as above.
formula = (
"""
log_B ~
+ rain
+ snow
+ is_long_weekend
+ C(is_holiday)
+ C(season)
+ C(day_of_week)
"""
)
B_model = smf.ols(formula, data=B_sales_df).fit()
print(B_model.summary()) OLS Regression Results
==============================================================================
Dep. Variable: log_B R-squared: 0.850
Model: OLS Adj. R-squared: 0.844
Method: Least Squares F-statistic: 142.2
Date: Fri, 15 Aug 2025 Prob (F-statistic): 1.87e-125
Time: 13:57:59 Log-Likelihood: 261.37
No. Observations: 339 AIC: -494.7
Df Residuals: 325 BIC: -441.2
Df Model: 13
Covariance Type: nonrobust
===============================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------
Intercept 7.3488 0.021 351.536 0.000 7.308 7.390
is_long_weekend[T.True] 0.0995 0.029 3.404 0.001 0.042 0.157
C(is_holiday)[T.True] 0.2682 0.039 6.812 0.000 0.191 0.346
C(season)[T.Spring] 0.1558 0.019 8.380 0.000 0.119 0.192
C(season)[T.Summer] 0.4376 0.017 25.364 0.000 0.404 0.472
C(season)[T.Fall] 0.0478 0.017 2.751 0.006 0.014 0.082
C(day_of_week)[T.Tuesday] -0.0073 0.024 -0.299 0.765 -0.055 0.040
C(day_of_week)[T.Wednesday] 0.0294 0.024 1.219 0.224 -0.018 0.077
C(day_of_week)[T.Thursday] 0.0431 0.024 1.789 0.075 -0.004 0.090
C(day_of_week)[T.Friday] 0.3129 0.024 12.881 0.000 0.265 0.361
C(day_of_week)[T.Saturday] 0.4398 0.025 17.840 0.000 0.391 0.488
C(day_of_week)[T.Sunday] 0.3846 0.024 15.718 0.000 0.336 0.433
rain -0.0063 0.001 -6.772 0.000 -0.008 -0.004
snow -0.0348 0.011 -3.044 0.003 -0.057 -0.012
==============================================================================
Omnibus: 5.425 Durbin-Watson: 1.387
Prob(Omnibus): 0.066 Jarque-Bera (JB): 6.803
Skew: -0.122 Prob(JB): 0.0333
Kurtosis: 3.650 Cond. No. 64.2
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Analysis
Item B sales is particularly weak during the Winter.
Item B sales are more sensitive to weather conditions, weekly seasonality and long weekends.
Model interpretation
To improve model performance, a log transformation was applied to item B sales. For easier interpretation, we will reverse this transformation by exponentiating the coefficients below.
exp_coef| coeff | p-value | |
|---|---|---|
| features | ||
| Intercept | 1554.34 | 0.000 |
| is_long_weekend[T.True] | 1.10 | 0.001 |
| C(is_holiday)[T.True] | 1.31 | 0.000 |
| C(season)[T.Spring] | 1.17 | 0.000 |
| C(season)[T.Summer] | 1.55 | 0.000 |
| C(season)[T.Fall] | 1.05 | 0.006 |
| C(day_of_week)[T.Tuesday] | 0.99 | 0.765 |
| C(day_of_week)[T.Wednesday] | 1.03 | 0.224 |
| C(day_of_week)[T.Thursday] | 1.04 | 0.075 |
| C(day_of_week)[T.Friday] | 1.37 | 0.000 |
| C(day_of_week)[T.Saturday] | 1.55 | 0.000 |
| C(day_of_week)[T.Sunday] | 1.47 | 0.000 |
| rain | 0.99 | 0.000 |
| snow | 0.97 | 0.003 |
Base case: Average item B sales = $1554.34 on a regular Monday in Winter with no rain and no snow.
is_long_weekend - All else being equal, on average, sales are approximately 10% higher on long weekends compared to regular weekends.
is_holiday - All else being equal, on average, sales during on a holiday is roughly 31% higher than the base case.
season - Compared to Winter, and all else being equal: - Spring sales are 17% higher - Summer sales are 55% higher - Fall sales are 5% higher
day_of_week - No significant difference in item B sales between the weekdays (Mon - Thurs).
- Relative to Monday, and all else being equal:
- Friday sales are 37% higher
- Saturday sales are 55% higher
- Sunday sales are 47% higher
rain - Each additional 1mm of rainfall is associated with a 1% decrease in item B sales, all else being equal.
snow - Each additional 1mm of snowfall is associated with a 3% decrease in item B sales, all else being equal.
For example:
Average item B sales on a Friday in the Fall with 3mm of rain and no snow = $1554.34 x 1.37 x 1.05 x 0.99^3 = $2169.51
3.3 Understanding item C sales
We will be modelling overall item C sales with the same approach as above.
formula = (
"""
log_C ~
+ rain
+ snow
+ is_long_weekend
+ C(is_holiday)
+ C(season)
+ C(day_of_week)
"""
)
C_model = smf.ols(formula, data=C_sales_df).fit()
print(C_model.summary()) OLS Regression Results
==============================================================================
Dep. Variable: log_C R-squared: 0.596
Model: OLS Adj. R-squared: 0.580
Method: Least Squares F-statistic: 36.93
Date: Fri, 15 Aug 2025 Prob (F-statistic): 3.16e-56
Time: 13:57:59 Log-Likelihood: 467.48
No. Observations: 339 AIC: -907.0
Df Residuals: 325 BIC: -853.4
Df Model: 13
Covariance Type: nonrobust
===============================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------
Intercept 7.0359 0.011 618.186 0.000 7.014 7.058
is_long_weekend[T.True] 0.0142 0.016 0.889 0.375 -0.017 0.045
C(is_holiday)[T.True] 0.0968 0.021 4.518 0.000 0.055 0.139
C(season)[T.Spring] 0.0749 0.010 7.402 0.000 0.055 0.095
C(season)[T.Summer] 0.0904 0.009 9.623 0.000 0.072 0.109
C(season)[T.Fall] 0.0087 0.009 0.920 0.358 -0.010 0.027
C(day_of_week)[T.Tuesday] -0.0076 0.013 -0.572 0.568 -0.034 0.018
C(day_of_week)[T.Wednesday] 0.0129 0.013 0.981 0.327 -0.013 0.039
C(day_of_week)[T.Thursday] 0.0128 0.013 0.974 0.331 -0.013 0.039
C(day_of_week)[T.Friday] 0.0849 0.013 6.416 0.000 0.059 0.111
C(day_of_week)[T.Saturday] 0.1485 0.013 11.061 0.000 0.122 0.175
C(day_of_week)[T.Sunday] 0.1243 0.013 9.332 0.000 0.098 0.151
rain -0.0007 0.001 -1.286 0.199 -0.002 0.000
snow -0.0115 0.006 -1.842 0.066 -0.024 0.001
==============================================================================
Omnibus: 13.287 Durbin-Watson: 1.290
Prob(Omnibus): 0.001 Jarque-Bera (JB): 16.301
Skew: 0.360 Prob(JB): 0.000289
Kurtosis: 3.797 Cond. No. 64.2
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Model interpretation
To improve model performance, a log transformation was applied to item C sales. For easier interpretation, we will reverse this transformation by exponentiating the coefficients below.
exp_coef| coeff | p-values | |
|---|---|---|
| features | ||
| Intercept | 1136.75 | 0.000 |
| is_long_weekend[T.True] | 1.01 | 0.375 |
| C(is_holiday)[T.True] | 1.10 | 0.000 |
| C(season)[T.Spring] | 1.08 | 0.000 |
| C(season)[T.Summer] | 1.09 | 0.000 |
| C(season)[T.Fall] | 1.01 | 0.358 |
| C(day_of_week)[T.Tuesday] | 0.99 | 0.568 |
| C(day_of_week)[T.Wednesday] | 1.01 | 0.327 |
| C(day_of_week)[T.Thursday] | 1.01 | 0.331 |
| C(day_of_week)[T.Friday] | 1.09 | 0.000 |
| C(day_of_week)[T.Saturday] | 1.16 | 0.000 |
| C(day_of_week)[T.Sunday] | 1.13 | 0.000 |
| rain | 1.00 | 0.199 |
| snow | 0.99 | 0.066 |
Base case: Average item C sales = $1136.75 on a regular Monday in Winter with no rain and no snow.
is_long_weekend - No significant impact on sales.
is_holiday - All else being equal, on average, sales during on a holiday is roughly 10% higher than the base case.
season - No significant difference in item C sales between Winter and Fall.
- Compared to Winter, and all else being equal:
- Spring sales are 8% higher
- Summer sales are 9% higher
day_of_week - No significant difference in item C sales between the weekdays (Mon - Thurs).
- Relative to Monday, and all else being equal:
- Friday sales are 9% higher
- Saturday sales are 16% higher
- Sunday sales are 13% higher
rain - No significant impact on sales.
snow - No significant impact on sales.
For example:
Average item C sales on a Sunday in the Spring with no snow = $1136.75 x 1.13 x 1.08 = $1387.29
Note: It’s worth noting however that the adjusted R-squared of the regression model is lower compared to the item A and item B models. This suggests certain important variables that explain item C sales are missing. Further investigation is warranted.
Appendix 1 - Statistical tests on recommendations from section 1.2
Is the average number of orders higher during the Spring/Summer compared to the Fall/Winter?
order_warm = combined_df[(combined_df['season']=='Spring') | (combined_df['season']=='Summer')]['in_store_orders']
order_cold = combined_df[(combined_df['season']=='Fall') | (combined_df['season']=='Winter')]['in_store_orders']# test for equal variance, a ratio under 2 (or above 0.5) is considered acceptable
(order_warm.describe()['std'] / order_cold.describe()['std']).round(2)1.21
# perform two-sample t-test with equal variance
statistic, p_value = ttest_ind(order_warm, order_cold, alternative='greater')
print(f'Test statistic: {statistic:.3f}')
print(f'p-value: {p_value:.3f}')Test statistic: 8.917
p-value: 0.000
There is statistical evidence suggesting the average number of orders is higher during the Spring/Summer compared to the Fall/Winter.
Is the average sales per order higher during holiday’s compared to Fridays and Weekends (non-Summer seasons)?
Note: We will focus our test on comparing holidays and weekends. Since weekends already exhibit higher average sales per order than Fridays, a significant difference between holidays and weekends would also imply holidays differ sigificantly from Fridays.
spo_holiday = combined_df[(combined_df['type_of_day']=='Holiday') & (combined_df['season']!='Summer')]['sales_per_order']
spo_weekend = combined_df[(combined_df['type_of_day']=='Weekend') & (combined_df['season']!='Summer')]['sales_per_order']# test for equal variance, a ratio under 2 is considered acceptable
(spo_holiday.describe()['std'] / spo_weekend.describe()['std']).round(2)1.31
# since our sample for holiday's is small, let's perform a Shapiro-Wilk test to check for normality
statistic, p_value = shapiro(spo_holiday)
print(f'Test statistic: {statistic:.3f}')
print(f'p-value: {p_value:.3f}')
if p_value < 0.05:
print('The data is not normally distributed.')
else:
print('The data is normally distributed.')Test statistic: 0.896
p-value: 0.164
The data is normally distributed.
# perform two-sample t-test
statistic, p_value = ttest_ind(spo_holiday, spo_weekend, alternative='greater')
print(f'Test statistic: {statistic:.3f}')
print(f'p-value: {p_value:.3f}')Test statistic: 1.797
p-value: 0.038
There is statistical evidence suggesting the average sales per order is higher on holidays than weekends.
Appendix 2 - Regression models - multicollinearity checks
Multicollinearity checks assess whether two or more independent variables in a regression model are highly correlated with each other. When independent variables are strongly correlated, it becomes difficult for the model to determine their individual effects and may lead to unreliable results.
Here we will be checking with Variance Inflation Factor (VIF). As a general rule of thumb, a VIF under 5 indicates there is no concern of multicollinearity.
HCF item A / item B models
vif_df.round(3)| Features | VIF | |
|---|---|---|
| 0 | rain | 1.685 |
| 1 | snow | 1.172 |
| 2 | is_long_weekend_True | 1.185 |
| 3 | is_holiday_True | 1.375 |
| 4 | day_of_week_Tuesday | 1.169 |
| 5 | day_of_week_Wednesday | 1.242 |
| 6 | day_of_week_Thursday | 1.179 |
| 7 | day_of_week_Friday | 1.258 |
| 8 | day_of_week_Saturday | 1.259 |
| 9 | day_of_week_Sunday | 1.187 |
| 10 | is_HCF_True | 1.657 |
No signs of multicollinearity among the existing variables.
Overall Item A / item B / item C sales models
vif_df.round(3)| Features | VIF | |
|---|---|---|
| 0 | hours_opened | 17.308 |
| 1 | avg_temperature | 17.791 |
| 2 | rain | 1.357 |
| 3 | snow | 1.128 |
| 4 | HCF_sales | 1.878 |
| 5 | is_long_weekend_True | 1.644 |
| 6 | is_holiday_True | 1.543 |
| 7 | season_Spring | 2.403 |
| 8 | season_Summer | 7.138 |
| 9 | season_Fall | 3.488 |
| 10 | day_of_week_Tuesday | 2.138 |
| 11 | day_of_week_Wednesday | 2.160 |
| 12 | day_of_week_Thursday | 2.150 |
| 13 | day_of_week_Friday | 2.368 |
| 14 | day_of_week_Saturday | 2.468 |
| 15 | day_of_week_Sunday | 2.358 |
The results indicate there is multicollinearity between hours_opened, avg_temperature and season_Summer. Since hours_opened is largely explained by the day of the week and avg_temperature is correlated with seasonal effects, we will drop these two variables to reduce redundancy and improve model reliability.
vif_df.round(3)| Features | VIF | |
|---|---|---|
| 0 | rain | 1.327 |
| 1 | snow | 1.116 |
| 2 | HCF_sales | 1.364 |
| 3 | is_long_weekend_True | 1.634 |
| 4 | is_holiday_True | 1.467 |
| 5 | season_Spring | 1.647 |
| 6 | season_Summer | 1.854 |
| 7 | season_Fall | 1.912 |
| 8 | day_of_week_Tuesday | 1.440 |
| 9 | day_of_week_Wednesday | 1.459 |
| 10 | day_of_week_Thursday | 1.432 |
| 11 | day_of_week_Friday | 1.461 |
| 12 | day_of_week_Saturday | 1.612 |
| 13 | day_of_week_Sunday | 1.538 |
There are no longer signs of multicollinearity after dropping hours_opened and avg_temperature.
Appendix 3: Regression models - residual plots
We examine the residuals of our regression models to check for any discernible patterns. A random scatter of residuals, with no clear pattern, indicates that the model adequately captures the underlying relationships in the data and that the assumption of linearity between the target variable and independent variables is likely satisfied.
HCF - item A model
hcf_A_scatter.show()There is no discernable pattern in the residuals.
HCF - item B model
hcf_B_scatter.show()There is no discernable pattern in the residuals.
Item A model
A_scatter.show()There is no discernable pattern in the residuals.
Item B model
B_scatter.show()A couple bigger residuals for smaller fitted values, but overall no clear pattern.
Item C model
C_scatter.show()There are signs of heteroscedasticity, as the variance of the residuals increases with larger predicted values. This suggests potential model misspecification, possibly due to missing variables in the model. Further investigation is needed to identify and address the cause.